Multimodal Context for Natural Question and Response Generation
نویسندگان
چکیده
The popularity of image sharing on social media reflects the important role visual context plays in everyday conversation. In this paper, we present a novel task, ImageGrounded Conversations (IGC), in which natural-sounding conversations are generated about shared photographic images. We investigate this task using training data derived from image-grounded conversations on social media and introduce a new dataset of crowd-sourced conversations for benchmarking progress. Experiments using deep neural network models trained on social media data show that the combination of visual and textual context can enhance the quality of generated conversational turns. In human evaluation, a gap between human performance and that of both neural and retrieval architectures suggests that IGC presents an interesting challenge for vision and language research.
منابع مشابه
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation
The popularity of image sharing on social media and the engagement it creates between users reflect the important role that visual context plays in everyday conversations. We present a novel task, ImageGrounded Conversations (IGC), in which natural-sounding conversations are generated about a shared image. To benchmark progress, we introduce a new multiplereference dataset of crowd-sourced, eve...
متن کاملEvaluating Questions in Context
We present an evaluation methodology and a system for ranking questions within the context of a multimodal tutorial dialogue. Such a framework has applications for automatic question selection and generation in intelligent tutoring systems. To create this ranking system we manually author candidate questions for specific points in a dialogue and have raters assign scores to these questions. To ...
متن کاملAnswering Questions about Moving Objects in Surveillance Videos
Current question answering systems succeed in many respects regarding questions about textual documents. However, information exists in other media, which provides both opportunities and challenges for question answering. We present results in extending question answering capabilities to video footage captured in a surveillance setting. Our prototype system, called Spot, can answer questions ab...
متن کاملAnswering Questions About Moving Objects in Videos
Current question answering systems succeed in many respects regarding questions about textual documents. However, information exists in other media, which provides both opportunities and challenges for question answering. We describe our efforts in extending question answering capabilities to video data: our implemented prototype, Spot, can answer questions about moving objects in a surveillanc...
متن کاملPlan-Based Integration of Natural Language and Graphics Generation
W. Wahlster, E. André, W. Finkler, H.-J. Profitlich and T. Rist, Plan-based integration of natural language and graphics generation, Artificial Intelligence 63 (1993) 387-427. Multimodal interfaces combining natural language and graphics take advantage of both the individual strength of each communication mode and the fact that several modes can be employed in parallel. The central claim of thi...
متن کامل